Name: N S Ramanujam Mangena

We were given a dataset which states who availed Term deposit from the previous campaign. Before we start with analyis, as an every data science project lets establish some basic understanding of the given problem statement.

Initial inference is just 11.7% of the members had availed Term deposit. This suggests that data is highly inbalanced. Now our aim is to build a model to predict the number of people who had actually taken Term deposit. Hence False negatives are vital for model performance and we concentrate on recall to measure the model's performance.

More over this is a classification problem as our target column is 'Target' which is categorical in nature. 1/0(True/False). We use Ensemble methods to build predictive models.

In [1]:
#import basic libraries
import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
import seaborn as sns 
%matplotlib inline
sns.set(style="whitegrid")

import warnings
warnings.filterwarnings("ignore")
In [132]:
bank_df = pd.read_csv("bank-full.csv")
bank_df.shape
Out[132]:
(45211, 17)
In [133]:
bank_df2 = bank_df.copy()
In [18]:
bank_df.head()
Out[18]:
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome Target
0 58 management married tertiary no 2143 yes no unknown 5 may 261 1 -1 0 unknown no
1 44 technician single secondary no 29 yes no unknown 5 may 151 1 -1 0 unknown no
2 33 entrepreneur married secondary no 2 yes yes unknown 5 may 76 1 -1 0 unknown no
3 47 blue-collar married unknown no 1506 yes no unknown 5 may 92 1 -1 0 unknown no
4 33 unknown single unknown no 1 no no unknown 5 may 198 1 -1 0 unknown no
In [72]:
sum(bank_df.duplicated())
Out[72]:
0

Suggests no duplicate rows

In [122]:
bank_df.isnull().sum()
Out[122]:
age          0
job          0
marital      0
education    0
default      0
balance      0
housing      0
loan         0
contact      0
day          0
month        0
campaign     0
pdays        0
previous     0
poutcome     0
Target       0
dtype: int64
In [123]:
bank_df.isna().sum()
Out[123]:
age          0
job          0
marital      0
education    0
default      0
balance      0
housing      0
loan         0
contact      0
day          0
month        0
campaign     0
pdays        0
previous     0
poutcome     0
Target       0
dtype: int64

suggests that no null vaalues or NaN rows in the dataset.

In [5]:
bank_df.describe()
Out[5]:
age balance day duration campaign pdays previous
count 45211.000000 45211.000000 45211.000000 45211.000000 45211.000000 45211.000000 45211.000000
mean 40.936210 1362.272058 15.806419 258.163080 2.763841 40.197828 0.580323
std 10.618762 3044.765829 8.322476 257.527812 3.098021 100.128746 2.303441
min 18.000000 -8019.000000 1.000000 0.000000 1.000000 -1.000000 0.000000
25% 33.000000 72.000000 8.000000 103.000000 1.000000 -1.000000 0.000000
50% 39.000000 448.000000 16.000000 180.000000 2.000000 -1.000000 0.000000
75% 48.000000 1428.000000 21.000000 319.000000 3.000000 -1.000000 0.000000
max 95.000000 102127.000000 31.000000 4918.000000 63.000000 871.000000 275.000000
  • Five point summary suggest that data is very screwed in all the contineous variables.
  • pdays has many records with -1 which doesnt seem to be right.
  • Many accounts have negative balances, could be overdraft accounts.
In [134]:
bank_df.skew().sort_values()
Out[134]:
day          0.093079
age          0.684818
pdays        2.615715
duration     3.144318
campaign     4.898650
balance      8.360308
previous    41.846454
dtype: float64

Previous column is highly skewed.

In [135]:
bank_df[bank_df.pdays < 0].pdays.count()
Out[135]:
36954

As its very high in quantity, We can consider absolute values of the column for pdays.

In [136]:
bank_df[bank_df.balance < 0].balance.count()
Out[136]:
3766

We have 3766 members who have utilised overdraft account.

In [25]:
bank_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
age          45211 non-null int64
job          45211 non-null int32
marital      45211 non-null int32
education    45211 non-null int32
default      45211 non-null int32
balance      45211 non-null int64
housing      45211 non-null int32
loan         45211 non-null int32
contact      45211 non-null int32
day          45211 non-null int64
month        45211 non-null int32
duration     45211 non-null int64
campaign     45211 non-null int64
pdays        45211 non-null int64
previous     45211 non-null int64
poutcome     45211 non-null int32
Target       45211 non-null int32
dtypes: int32(10), int64(7)
memory usage: 4.1 MB

Many columns have the object datatype which cant be used to perform any model building. We will later use Lable Encoder library to convert them to machine readable formats.

In [137]:
bank_df.drop(['duration'],axis =1,inplace = True)

This columns is dropped as it wouldnt be useful in building prediction models and problem statement advised us to ignore this colum as it could lead into inappropriate model building.

  • Replacing negative pdays values with absolute values
In [138]:
bank_df['pdays'] = bank_df['pdays'].abs()
  • Displaying the categorical columns summary and how data is distributed for a quick reference.
In [139]:
#Drop duplicate records
bank_df.drop_duplicates(inplace =True)
In [140]:
sum(bank_df.duplicated())
Out[140]:
0
In [21]:
bank_df2[['education','job','marital','default','housing','loan','contact','month','poutcome','Target']].apply(lambda x: x.value_counts()).T.stack()
Out[21]:
education  primary           6851.0
           secondary        23202.0
           tertiary         13301.0
           unknown           1857.0
job        admin.            5171.0
           blue-collar       9732.0
           entrepreneur      1487.0
           housemaid         1240.0
           management        9458.0
           retired           2264.0
           self-employed     1579.0
           services          4154.0
           student            938.0
           technician        7597.0
           unemployed        1303.0
           unknown            288.0
marital    divorced          5207.0
           married          27214.0
           single           12790.0
default    no               44396.0
           yes                815.0
housing    no               20081.0
           yes              25130.0
loan       no               37967.0
           yes               7244.0
contact    cellular         29285.0
           telephone         2906.0
           unknown          13020.0
month      apr               2932.0
           aug               6247.0
           dec                214.0
           feb               2649.0
           jan               1403.0
           jul               6895.0
           jun               5341.0
           mar                477.0
           may              13766.0
           nov               3970.0
           oct                738.0
           sep                579.0
poutcome   failure           4901.0
           other             1840.0
           success           1511.0
           unknown          36959.0
Target     no               39922.0
           yes               5289.0
dtype: float64

Note:

As unknown values are given as categorical in problem statement, we are treating them as one of the category. They are not equal to NaN where actual data would be missing.

Lable Encoding

We will replace above listed columns with the machine learning format. Here were using lable encoder technique so that we dont create new columns(like in One-Hot encoding). Object columns as well are covereted to integer.

In [141]:
#Lable Encoder to change the categorical values to numerical values.
from sklearn.preprocessing import LabelEncoder
from sklearn import preprocessing
In [142]:
le = preprocessing.LabelEncoder()
bank_df.job = le.fit_transform(bank_df.job)
bank_df.marital = le.fit_transform(bank_df.marital)
bank_df.education = le.fit_transform(bank_df.education)
bank_df.default = le.fit_transform(bank_df.default)
bank_df.housing = le.fit_transform(bank_df.housing)
bank_df.loan = le.fit_transform(bank_df.loan)
bank_df.contact = le.fit_transform(bank_df.contact)
bank_df.month = le.fit_transform(bank_df.month)
bank_df.poutcome = le.fit_transform(bank_df.poutcome)
bank_df.Target = le.fit_transform(bank_df.Target)
bank_df.head()
bank_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 45195 entries, 0 to 45210
Data columns (total 16 columns):
age          45195 non-null int64
job          45195 non-null int32
marital      45195 non-null int32
education    45195 non-null int32
default      45195 non-null int32
balance      45195 non-null int64
housing      45195 non-null int32
loan         45195 non-null int32
contact      45195 non-null int32
day          45195 non-null int64
month        45195 non-null int32
campaign     45195 non-null int64
pdays        45195 non-null int64
previous     45195 non-null int64
poutcome     45195 non-null int32
Target       45195 non-null int32
dtypes: int32(10), int64(6)
memory usage: 4.1 MB

Univariate/Bivariate Analysis

Its clear that below are contineous variables in given dataset. 
  • Age
  • Balance
  • Last contacted
  • Campaign
  • pdays- though numeric,due to its uneven distribution, unable to plot the details.

Rest of the values are categorical/Nominal/binary We can perform Univariate analysis on these parameters.

In [22]:
plt.figure(figsize=(30,5))
# subplot 1
plt.subplot(1, 5, 1)
plt.title('Customer Age Distribution')
sns.distplot(bank_df.age,color='Red')

# subplot 2
plt.subplot(1, 5, 2)
plt.title('Customer balance')
sns.distplot(bank_df.balance,color='green')

# subplot 3l
plt.subplot(1, 5, 3)
plt.title('Customer last contacted')
sns.distplot(bank_df.day,color='blue')

# subplot 4l
#plt.subplot(1, 5, 4)
#plt.title('Customer last conversation duration')
#sns.distplot(bank_df.duration,color='Orange')

# subplot 5l
plt.subplot(1, 5, 4)
plt.title('Customer campaign Distribution')
sns.distplot(bank_df.campaign,color='brown')


plt.figure(figsize=(30,5))
# subplot 1
plt.subplot(1, 5, 1)
plt.title('Customer Age boxribution')
sns.boxplot(bank_df.age,orient='vertical',color='Red')

# subplot 2
plt.subplot(1, 5, 2)
plt.title('Customer balance')
sns.boxplot(bank_df.balance,orient='vertical',color='green')

# subplot 3l
plt.subplot(1, 5, 3)
plt.title('Customer last contacted')
sns.boxplot(bank_df.day,orient='vertical',color='blue')

# subplot 4l
#plt.subplot(1, 5, 4)
#plt.title('Customer last conversation duration')
#sns.boxplot(bank_df.duration,orient='vertical',color='Orange')

# subplot 5l
plt.subplot(1, 5, 4)
plt.title('Customer campaign distribution')
sns.boxplot(bank_df.campaign,orient='vertical',color='brown')
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x1b923692080>

Data is highly skewed and unbalanced. Further inferences to be followed post profiling.

Alternatively, there is a pandas profling library which can be used to analyse each and every attribute in the given dataset.

In [143]:
import pandas_profiling
from pandas_profiling import ProfileReport
In [144]:
ProfileReport(bank_df)
Out[144]:

Overview

Dataset info

Number of variables 17
Number of observations 45195
Total Missing (%) 0.0%
Total size in memory 4.1 MiB
Average record size in memory 96.0 B

Variables types

Numeric 13
Categorical 0
Boolean 4
Date 0
Text (Unique) 0
Rejected 0
Unsupported 0

Warnings

  • balance has 3498 / 7.7% zeros Zeros
  • contact has 29271 / 64.8% zeros Zeros
  • education has 6848 / 15.2% zeros Zeros
  • job has 5171 / 11.4% zeros Zeros
  • marital has 5207 / 11.5% zeros Zeros
  • month has 2932 / 6.5% zeros Zeros
  • poutcome has 4901 / 10.8% zeros Zeros
  • previous is highly skewed (γ1 = 41.84) Skewed
  • previous has 36938 / 81.7% zeros Zeros

Variables

Target
Boolean

Distinct count 2
Unique (%) 0.0%
Missing (%) 0.0%
Missing (n) 0
Mean 0.11703
0
39906
1
 
5289
Value Count Frequency (%)  
0 39906 88.3%
 
1 5289 11.7%
 

age
Numeric

Distinct count 77
Unique (%) 0.2%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 40.938
Minimum 18
Maximum 95
Zeros (%) 0.0%

Quantile statistics

Minimum 18
5-th percentile 27
Q1 33
Median 39
Q3 48
95-th percentile 59
Maximum 95
Range 77
Interquartile range 15

Descriptive statistics

Standard deviation 10.619
Coef of variation 0.2594
Kurtosis 0.31953
Mean 40.938
MAD 8.7374
Skewness 0.68468
Sum 1850175
Variance 112.77
Memory size 353.2 KiB
Value Count Frequency (%)  
32 2082 4.6%
 
31 1995 4.4%
 
33 1971 4.4%
 
34 1928 4.3%
 
35 1893 4.2%
 
36 1806 4.0%
 
30 1755 3.9%
 
37 1696 3.8%
 
39 1487 3.3%
 
38 1466 3.2%
 
Other values (67) 27116 60.0%
 

Minimum 5 values

Value Count Frequency (%)  
18 12 0.0%
 
19 35 0.1%
 
20 50 0.1%
 
21 79 0.2%
 
22 129 0.3%
 

Maximum 5 values

Value Count Frequency (%)  
90 2 0.0%
 
92 2 0.0%
 
93 2 0.0%
 
94 1 0.0%
 
95 2 0.0%
 

balance
Numeric

Distinct count 7168
Unique (%) 15.9%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 1362.8
Minimum -8019
Maximum 102127
Zeros (%) 7.7%

Quantile statistics

Minimum -8019
5-th percentile -172
Q1 72
Median 449
Q3 1428
95-th percentile 5768.3
Maximum 102127
Range 110146
Interquartile range 1356

Descriptive statistics

Standard deviation 3045.2
Coef of variation 2.2346
Kurtosis 140.72
Mean 1362.8
MAD 1551.8
Skewness 8.3593
Sum 61589682
Variance 9273200
Memory size 353.2 KiB
Value Count Frequency (%)  
0 3498 7.7%
 
1 195 0.4%
 
2 156 0.3%
 
4 139 0.3%
 
3 134 0.3%
 
5 113 0.3%
 
6 88 0.2%
 
8 81 0.2%
 
23 75 0.2%
 
10 69 0.2%
 
Other values (7158) 40647 89.9%
 

Minimum 5 values

Value Count Frequency (%)  
-8019 1 0.0%
 
-6847 1 0.0%
 
-4057 1 0.0%
 
-3372 1 0.0%
 
-3313 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
66721 1 0.0%
 
71188 1 0.0%
 
81204 2 0.0%
 
98417 1 0.0%
 
102127 1 0.0%
 

campaign
Numeric

Distinct count 48
Unique (%) 0.1%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 2.764
Minimum 1
Maximum 63
Zeros (%) 0.0%

Quantile statistics

Minimum 1
5-th percentile 1
Q1 1
Median 2
Q3 3
95-th percentile 8
Maximum 63
Range 62
Interquartile range 2

Descriptive statistics

Standard deviation 3.0983
Coef of variation 1.121
Kurtosis 39.248
Mean 2.764
MAD 1.7916
Skewness 4.8986
Sum 124918
Variance 9.5995
Memory size 353.2 KiB
Value Count Frequency (%)  
1 17539 38.8%
 
2 12497 27.7%
 
3 5520 12.2%
 
4 3521 7.8%
 
5 1764 3.9%
 
6 1291 2.9%
 
7 735 1.6%
 
8 540 1.2%
 
9 327 0.7%
 
10 265 0.6%
 
Other values (38) 1196 2.6%
 

Minimum 5 values

Value Count Frequency (%)  
1 17539 38.8%
 
2 12497 27.7%
 
3 5520 12.2%
 
4 3521 7.8%
 
5 1764 3.9%
 

Maximum 5 values

Value Count Frequency (%)  
50 2 0.0%
 
51 1 0.0%
 
55 1 0.0%
 
58 1 0.0%
 
63 1 0.0%
 

contact
Numeric

Distinct count 3
Unique (%) 0.0%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.64038
Minimum 0
Maximum 2
Zeros (%) 64.8%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 0
Q3 2
95-th percentile 2
Maximum 2
Range 2
Interquartile range 2

Descriptive statistics

Standard deviation 0.89799
Coef of variation 1.4023
Kurtosis -1.3171
Mean 0.64038
MAD 0.8295
Skewness 0.76904
Sum 28942
Variance 0.80639
Memory size 353.2 KiB
Value Count Frequency (%)  
0 29271 64.8%
 
2 13018 28.8%
 
1 2906 6.4%
 

Minimum 5 values

Value Count Frequency (%)  
0 29271 64.8%
 
1 2906 6.4%
 
2 13018 28.8%
 

Maximum 5 values

Value Count Frequency (%)  
0 29271 64.8%
 
1 2906 6.4%
 
2 13018 28.8%
 

day
Numeric

Distinct count 31
Unique (%) 0.1%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 15.805
Minimum 1
Maximum 31
Zeros (%) 0.0%

Quantile statistics

Minimum 1
5-th percentile 3
Q1 8
Median 16
Q3 21
95-th percentile 29
Maximum 31
Range 30
Interquartile range 13

Descriptive statistics

Standard deviation 8.3228
Coef of variation 0.5266
Kurtosis -1.0598
Mean 15.805
MAD 7.0561
Skewness 0.093451
Sum 714299
Variance 69.269
Memory size 353.2 KiB
Value Count Frequency (%)  
20 2752 6.1%
 
18 2308 5.1%
 
21 2021 4.5%
 
17 1939 4.3%
 
6 1932 4.3%
 
5 1910 4.2%
 
14 1847 4.1%
 
8 1842 4.1%
 
28 1829 4.0%
 
7 1816 4.0%
 
Other values (21) 24999 55.3%
 

Minimum 5 values

Value Count Frequency (%)  
1 322 0.7%
 
2 1293 2.9%
 
3 1079 2.4%
 
4 1445 3.2%
 
5 1910 4.2%
 

Maximum 5 values

Value Count Frequency (%)  
27 1121 2.5%
 
28 1829 4.0%
 
29 1744 3.9%
 
30 1566 3.5%
 
31 643 1.4%
 

default
Boolean

Distinct count 2
Unique (%) 0.0%
Missing (%) 0.0%
Missing (n) 0
Mean 0.018033
0
44380
1
 
815
Value Count Frequency (%)  
0 44380 98.2%
 
1 815 1.8%
 

education
Numeric

Distinct count 4
Unique (%) 0.0%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 1.2247
Minimum 0
Maximum 3
Zeros (%) 15.2%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 1
Median 1
Q3 2
95-th percentile 2
Maximum 3
Range 3
Interquartile range 1

Descriptive statistics

Standard deviation 0.74797
Coef of variation 0.61072
Kurtosis -0.26314
Mean 1.2247
MAD 0.60187
Skewness 0.19771
Sum 55352
Variance 0.55946
Memory size 353.2 KiB
Value Count Frequency (%)  
1 23199 51.3%
 
2 13291 29.4%
 
0 6848 15.2%
 
3 1857 4.1%
 

Minimum 5 values

Value Count Frequency (%)  
0 6848 15.2%
 
1 23199 51.3%
 
2 13291 29.4%
 
3 1857 4.1%
 

Maximum 5 values

Value Count Frequency (%)  
0 6848 15.2%
 
1 23199 51.3%
 
2 13291 29.4%
 
3 1857 4.1%
 

housing
Boolean

Distinct count 2
Unique (%) 0.0%
Missing (%) 0.0%
Missing (n) 0
Mean 0.55595
1
25126
0
20069
Value Count Frequency (%)  
1 25126 55.6%
 
0 20069 44.4%
 

index
Numeric

Distinct count 45195
Unique (%) 100.0%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 22606
Minimum 0
Maximum 45210
Zeros (%) 0.0%

Quantile statistics

Minimum 0
5-th percentile 2259.7
Q1 11300
Median 22610
Q3 33912
95-th percentile 42950
Maximum 45210
Range 45210
Interquartile range 22611

Descriptive statistics

Standard deviation 13053
Coef of variation 0.5774
Kurtosis -1.2004
Mean 22606
MAD 11305
Skewness -0.00016898
Sum 1021695847
Variance 170380000
Memory size 353.2 KiB
Value Count Frequency (%)  
2047 1 0.0%
 
36091 1 0.0%
 
40281 1 0.0%
 
38232 1 0.0%
 
11599 1 0.0%
 
9550 1 0.0%
 
15693 1 0.0%
 
13644 1 0.0%
 
3403 1 0.0%
 
1354 1 0.0%
 
Other values (45185) 45185 100.0%
 

Minimum 5 values

Value Count Frequency (%)  
0 1 0.0%
 
1 1 0.0%
 
2 1 0.0%
 
3 1 0.0%
 
4 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
45206 1 0.0%
 
45207 1 0.0%
 
45208 1 0.0%
 
45209 1 0.0%
 
45210 1 0.0%
 

job
Numeric

Distinct count 12
Unique (%) 0.0%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 4.3394
Minimum 0
Maximum 11
Zeros (%) 11.4%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 1
Median 4
Q3 7
95-th percentile 9
Maximum 11
Range 11
Interquartile range 6

Descriptive statistics

Standard deviation 3.2728
Coef of variation 0.75421
Kurtosis -1.2704
Mean 4.3394
MAD 2.8003
Skewness 0.26189
Sum 196120
Variance 10.711
Memory size 353.2 KiB
Value Count Frequency (%)  
1 9730 21.5%
 
4 9451 20.9%
 
9 7593 16.8%
 
0 5171 11.4%
 
7 4152 9.2%
 
5 2263 5.0%
 
6 1579 3.5%
 
2 1487 3.3%
 
10 1303 2.9%
 
3 1240 2.7%
 
Other values (2) 1226 2.7%
 

Minimum 5 values

Value Count Frequency (%)  
0 5171 11.4%
 
1 9730 21.5%
 
2 1487 3.3%
 
3 1240 2.7%
 
4 9451 20.9%
 

Maximum 5 values

Value Count Frequency (%)  
7 4152 9.2%
 
8 938 2.1%
 
9 7593 16.8%
 
10 1303 2.9%
 
11 288 0.6%
 

loan
Boolean

Distinct count 2
Unique (%) 0.0%
Missing (%) 0.0%
Missing (n) 0
Mean 0.16028
0
37951
1
 
7244
Value Count Frequency (%)  
0 37951 84.0%
 
1 7244 16.0%
 

marital
Numeric

Distinct count 3
Unique (%) 0.0%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 1.1676
Minimum 0
Maximum 2
Zeros (%) 11.5%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 1
Median 1
Q3 2
95-th percentile 2
Maximum 2
Range 2
Interquartile range 1

Descriptive statistics

Standard deviation 0.60821
Coef of variation 0.52092
Kurtosis -0.43943
Mean 1.1676
MAD 0.47078
Skewness -0.10264
Sum 52768
Variance 0.36992
Memory size 353.2 KiB
Value Count Frequency (%)  
1 27208 60.2%
 
2 12780 28.3%
 
0 5207 11.5%
 

Minimum 5 values

Value Count Frequency (%)  
0 5207 11.5%
 
1 27208 60.2%
 
2 12780 28.3%
 

Maximum 5 values

Value Count Frequency (%)  
0 5207 11.5%
 
1 27208 60.2%
 
2 12780 28.3%
 

month
Numeric

Distinct count 12
Unique (%) 0.0%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 5.524
Minimum 0
Maximum 11
Zeros (%) 6.5%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 3
Median 6
Q3 8
95-th percentile 9
Maximum 11
Range 11
Interquartile range 5

Descriptive statistics

Standard deviation 3.0066
Coef of variation 0.54427
Kurtosis -0.99412
Mean 5.524
MAD 2.5489
Skewness -0.48083
Sum 249659
Variance 9.0394
Memory size 353.2 KiB
Value Count Frequency (%)  
8 13764 30.5%
 
5 6892 15.2%
 
1 6236 13.8%
 
6 5341 11.8%
 
9 3970 8.8%
 
0 2932 6.5%
 
3 2649 5.9%
 
4 1403 3.1%
 
10 738 1.6%
 
11 579 1.3%
 
Other values (2) 691 1.5%
 

Minimum 5 values

Value Count Frequency (%)  
0 2932 6.5%
 
1 6236 13.8%
 
2 214 0.5%
 
3 2649 5.9%
 
4 1403 3.1%
 

Maximum 5 values

Value Count Frequency (%)  
7 477 1.1%
 
8 13764 30.5%
 
9 3970 8.8%
 
10 738 1.6%
 
11 579 1.3%
 

pdays
Numeric

Distinct count 558
Unique (%) 1.2%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 41.847
Minimum 1
Maximum 871
Zeros (%) 0.0%

Quantile statistics

Minimum 1
5-th percentile 1
Q1 1
Median 1
Q3 1
95-th percentile 317
Maximum 871
Range 870
Interquartile range 0

Descriptive statistics

Standard deviation 99.471
Coef of variation 2.377
Kurtosis 7.0238
Mean 41.847
MAD 67.045
Skewness 2.6272
Sum 1891276
Variance 9894.6
Memory size 353.2 KiB
Value Count Frequency (%)  
1 36953 81.8%
 
182 167 0.4%
 
92 147 0.3%
 
183 126 0.3%
 
91 126 0.3%
 
181 117 0.3%
 
370 99 0.2%
 
184 85 0.2%
 
364 77 0.2%
 
95 74 0.2%
 
Other values (548) 7224 16.0%
 

Minimum 5 values

Value Count Frequency (%)  
1 36953 81.8%
 
2 37 0.1%
 
3 1 0.0%
 
4 2 0.0%
 
5 11 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
838 1 0.0%
 
842 1 0.0%
 
850 1 0.0%
 
854 1 0.0%
 
871 1 0.0%
 

poutcome
Numeric

Distinct count 4
Unique (%) 0.0%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 2.5598
Minimum 0
Maximum 3
Zeros (%) 10.8%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 3
Median 3
Q3 3
95-th percentile 3
Maximum 3
Range 3
Interquartile range 0

Descriptive statistics

Standard deviation 0.9892
Coef of variation 0.38643
Kurtosis 2.1507
Mean 2.5598
MAD 0.71962
Skewness -1.973
Sum 115691
Variance 0.97852
Memory size 353.2 KiB
Value Count Frequency (%)  
3 36943 81.7%
 
0 4901 10.8%
 
1 1840 4.1%
 
2 1511 3.3%
 

Minimum 5 values

Value Count Frequency (%)  
0 4901 10.8%
 
1 1840 4.1%
 
2 1511 3.3%
 
3 36943 81.7%
 

Maximum 5 values

Value Count Frequency (%)  
0 4901 10.8%
 
1 1840 4.1%
 
2 1511 3.3%
 
3 36943 81.7%
 

previous
Numeric

Distinct count 41
Unique (%) 0.1%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.58053
Minimum 0
Maximum 275
Zeros (%) 81.7%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 0
Q3 0
95-th percentile 3
Maximum 275
Range 275
Interquartile range 0

Descriptive statistics

Standard deviation 2.3038
Coef of variation 3.9685
Kurtosis 4505.5
Mean 0.58053
MAD 0.94894
Skewness 41.84
Sum 26237
Variance 5.3076
Memory size 353.2 KiB
Value Count Frequency (%)  
0 36938 81.7%
 
1 2772 6.1%
 
2 2106 4.7%
 
3 1142 2.5%
 
4 714 1.6%
 
5 459 1.0%
 
6 277 0.6%
 
7 205 0.5%
 
8 129 0.3%
 
9 92 0.2%
 
Other values (31) 361 0.8%
 

Minimum 5 values

Value Count Frequency (%)  
0 36938 81.7%
 
1 2772 6.1%
 
2 2106 4.7%
 
3 1142 2.5%
 
4 714 1.6%
 

Maximum 5 values

Value Count Frequency (%)  
41 1 0.0%
 
51 1 0.0%
 
55 1 0.0%
 
58 1 0.0%
 
275 1 0.0%
 

Correlations

Sample

age job marital education default balance housing loan contact day month campaign pdays previous poutcome Target
0 58 4 1 2 0 2143 1 0 2 5 8 1 1 0 3 0
1 44 9 2 1 0 29 1 0 2 5 8 1 1 0 3 0
2 33 2 1 1 0 2 1 1 2 5 8 1 1 0 3 0
3 47 1 1 3 0 1506 1 0 2 5 8 1 1 0 3 0
4 33 11 2 3 0 1 0 0 2 5 8 1 1 0 3 0
Inferences:
  • Below inferences obtained from pandas profiling aswell.

Data set is higly unblanced dataset and all the variables seems to be weak learners based on the correlation values. Dataset has only 11.7% Target values with Yes.

  1. Majority of the customers have between 20 and 60 years of age. Could be something to do with the retirement age.
  2. balance has records with negative entries, it could be something to do with overdraft accounts/loan accounts. There are some outliers. , we will be trying to remove some outliers.
  3. customer most contacted 20/21st of the month.
  4. Majority of the customers are contacted through cellular phone.
  5. 51% customers have Secondary education
  6. There are wide variety of jobs and out of all, blue collared jobs are high in dataset.
  7. 60% are married in the given data.
  8. Majority of customer contacts happened in the month of May.
  9. Majority of previous outcome is unknown.
  10. previous is highly skewed (γ1 = 41.846) Skewed
  11. Dataset has 16 duplicate rows post removal of duration column.
In [24]:
plt.title('Target Distribution')
ax=sns.countplot(data = bank_df, x= 'Target',palette='inferno') 
for p in ax.patches: 
    ax.annotate(str((np.round(p.get_height()/len(bank_df)*100,decimals=2)))+'%', (p.get_x()+p.get_width()/2., p.get_height()), ha='center', va='center', xytext=(0, 10), textcoords='offset points')

categorcial_variables = ['marital', 'education','default','housing','loan','contact','poutcome','month','job']

plt.figure(figsize=(60,10))
#categorcial_variables = ['job', 'marital', 'education','default','housing','loan','contact','month','poutcome','Target']
j = range(len(categorcial_variables))
for i in j:
    plt.subplot(1, 9, i+1)
    ax=sns.countplot(data = bank_df,palette='inferno', x= categorcial_variables[i]) 
    for p in ax.patches: 
        ax.annotate(str((np.round(p.get_height()/len(bank_df)*100,decimals=2)))+'%', (p.get_x()+p.get_width()/2., p.get_height()), ha='center', va='center', xytext=(0, 10), textcoords='offset points')


plt.figure(figsize=(60,10))
#categorcial_variables = ['job', 'marital', 'education','default','housing','loan','contact','month','poutcome','Target']
p = range(len(categorcial_variables))
for i in p:
    plt.subplot(1, 9, i+1)
    ax=sns.countplot(data = bank_df, x= categorcial_variables[i],palette='inferno',hue = 'Target') 
    for p in ax.patches: 
        ax.annotate(str((np.round(p.get_height()/len(bank_df)*100,decimals=2)))+'%', (p.get_x()+p.get_width()/2., p.get_height()), ha='center', va='center', xytext=(0, 10), textcoords='offset points')

plt.figure(figsize=(10,4))
sns.barplot(bank_df2['job'].value_counts().values, bank_df2['job'].value_counts().index)
plt.title('job')
plt.tight_layout()
Inferences:
  • Below inferences obtained from pandas profiling aswell.

  1. Data set has a very less polutation where target is market to Yes. (only 11.7 %)
  2. High number of Married members have opted for term deposit.
  3. High number of members with secondary education opted for Term deposit.
  4. Non defaulters have higher %of Term deposits.
  5. Members with no house loan have the Term deposits.
  6. Members who were contacted through Cellular phone holds more Term deposits.
  7. Most people taken Term deposits in the month of May.
  8. People in Management positions hold more term deposits.
  9. In the given dataset, blue collar members are high in number.
  10. There seems to be drop in balance after 60 years.

Normalizing the Data

  • Outliers

Trying to remove the outliers of mortgage with IQR is not helpful as its removing the valid data. Hence some of the outliers are reduced with z score and beyond threshold of 3. We would be removing 745 rows with is close to 2% which is with in permissable limits. Morever we are not worried about balance outliers as current dataset is highly unbalanced and further removing data would impact the predictability of the model.

In [145]:
#import z statistic library
from scipy import stats

bank_df['balance_z'] = np.abs(stats.zscore(bank_df.balance))
bank_df=bank_df[bank_df.balance_z <3]
  • Remove unwanted columns
    Based on the above inferences we are dropping the unwanted columns
In [146]:
bank_df.drop('balance_z', axis = 1, inplace=True)
In [147]:
bank_df.shape
Out[147]:
(44450, 16)
In [148]:
# correlation on the given dataset
rs = np.random.RandomState(0)
df = pd.DataFrame(rs.rand(10, 10))
corr = bank_df.corr()
corr.style.background_gradient(cmap='coolwarm')
Out[148]:
age job marital education default balance housing loan contact day month campaign pdays previous poutcome Target
age 1 -0.0212915 -0.402511 -0.107906 -0.0172308 0.104814 -0.183881 -0.0138414 0.0268168 -0.00851462 -0.0427404 0.00560666 -0.0238788 0.00112105 0.00726906 0.0246699
job -0.0212915 1 0.0614741 0.166598 -0.00676924 0.0237693 -0.126738 -0.0324384 -0.0817585 0.0226801 -0.0935328 0.00762933 -0.0247756 -0.00149045 0.0116864 0.0398955
marital -0.402511 0.0614741 1 0.109227 -0.00716488 0.00615874 -0.0160977 -0.0462285 -0.0391954 -0.00581688 -0.00661334 -0.010051 0.0191312 0.0148948 -0.0170023 0.0450142
education -0.107906 0.166598 0.109227 1 -0.00989407 0.0533001 -0.0907099 -0.0472705 -0.11024 0.0226284 -0.0588634 0.00676742 -0.000918769 0.0171298 -0.0188384 0.0656319
default -0.0172308 -0.00676924 -0.00716488 -0.00989407 1 -0.0978026 -0.00670588 0.0761787 0.0155268 0.00969652 0.0116192 0.016868 -0.0300713 -0.018336 0.0350933 -0.0222727
balance 0.104814 0.0237693 0.00615874 0.0533001 -0.0978026 1 -0.0673971 -0.101819 -0.0322571 0.0105426 0.0232353 -0.0233535 0.0126636 0.0297823 -0.041691 0.0737594
housing -0.183881 -0.126738 -0.0160977 -0.0907099 -0.00670588 -0.0673971 1 0.0394599 0.189118 -0.0291133 0.27294 -0.0232304 0.125472 0.0372654 -0.100228 -0.138913
loan -0.0138414 -0.0324384 -0.0462285 -0.0472705 0.0761787 -0.101819 0.0394599 1 -0.0116121 0.0120939 0.0226674 0.0102279 -0.0226589 -0.0110205 0.015134 -0.0679293
contact 0.0268168 -0.0817585 -0.0391954 -0.11024 0.0155268 -0.0322571 0.189118 -0.0116121 1 -0.0279288 0.364108 0.0192746 -0.244626 -0.1475 0.27273 -0.148001
day -0.00851462 0.0226801 -0.00581688 0.0226284 0.00969652 0.0105426 -0.0291133 0.0120939 -0.0279288 1 -0.00910897 0.163742 -0.094051 -0.0520433 0.0845149 -0.0292021
month -0.0427404 -0.0935328 -0.00661334 -0.0588634 0.0116192 0.0232353 0.27294 0.0226674 0.364108 -0.00910897 1 -0.108852 0.0335276 0.0230326 -0.0335514 -0.0233314
campaign 0.00560666 0.00762933 -0.010051 0.00676742 0.016868 -0.0233535 -0.0232304 0.0102279 0.0192746 0.163742 -0.108852 1 -0.0887968 -0.0324218 0.102007 -0.0734474
pdays -0.0238788 -0.0247756 0.0191312 -0.000918769 -0.0300713 0.0126636 0.125472 -0.0226589 -0.244626 -0.094051 0.0335276 -0.0887968 1 0.452673 -0.857547 0.102271
previous 0.00112105 -0.00149045 0.0148948 0.0171298 -0.018336 0.0297823 0.0372654 -0.0110205 -0.1475 -0.0520433 0.0230326 -0.0324218 0.452673 1 -0.488288 0.0925213
poutcome 0.00726906 0.0116864 -0.0170023 -0.0188384 0.0350933 -0.041691 -0.100228 0.015134 0.27273 0.0845149 -0.0335514 0.102007 -0.857547 -0.488288 1 -0.0773249
Target 0.0246699 0.0398955 0.0450142 0.0656319 -0.0222727 0.0737594 -0.138913 -0.0679293 -0.148001 -0.0292021 -0.0233314 -0.0734474 0.102271 0.0925213 -0.0773249 1
In [149]:
import seaborn as sns
sns.pairplot(bank_df)
Out[149]:
<seaborn.axisgrid.PairGrid at 0x1b91327ff28>
In [150]:
plt.figure(figsize=(30, 6))
plt.subplot(1,6,1)
#sns.jointplot(x = 'Income', y = 'CCAvg', data = loan_df,kind = 'kde')
#plt.subplot(1,4,2)
plt.title('Age vs Target')
sns.boxplot(x = 'age', y = 'Target',orient="horizontal",data = bank_df)
plt.subplot(1,6,2)
plt.title('Balance vs Target')
sns.boxplot(x = 'balance', y = 'Target',orient="horizontal", data = bank_df)
Out[150]:
<matplotlib.axes._subplots.AxesSubplot at 0x1b92d2ed278>
Inferences:

Inspite of clearing the outliers, couldnt notice much improvement with the outliers and above correlation matrix suggests very weak relation. Though we have weak learners we try to make use of them and try to use ensemble learning to improve the predictions.

Positively correlated columns

  • pdays
  • previous
  • balance
  • education
  • marital
  • job
  • age

Negatively correlated columns

  • default
  • month
  • day
  • loan
  • campaign
  • poutcome
  • housing
  • contact

Split Target columns and Independendent columns

Though columns have negative correlation and loosely corelated, we are not removing them as weak learners contribute for building decision trees and ensemble methods.

In [151]:
#Train and Test data split library import
from sklearn.model_selection import train_test_split 
In [152]:
X = bank_df.drop(['Target'],axis=1) # independant variables
y = bank_df['Target'] #Dependant variables

Split Data to 70:30 Train and Test ratio

In [153]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10)
In [154]:
print("Original Target True Values    : {0} ({1:0.2f}%)".format(len(bank_df.loc[bank_df['Target'] == 1]), (len(bank_df.loc[bank_df['Target'] == 1])/len(bank_df.index)) * 100))
print("Original Target False Values   : {0} ({1:0.2f}%)".format(len(bank_df.loc[bank_df['Target'] == 0]), (len(bank_df.loc[bank_df['Target'] == 0])/len(bank_df.index)) * 100))
print("")
print("Training Target True Values    : {0} ({1:0.2f}%)".format(len(y_train[y_train[:] == 1]), (len(y_train[y_train[:] == 1])/len(y_train)) * 100))
print("Training Target False Values   : {0} ({1:0.2f}%)".format(len(y_train[y_train[:] == 0]), (len(y_train[y_train[:] == 0])/len(y_train)) * 100))
print("")
print("Test Target True Values        : {0} ({1:0.2f}%)".format(len(y_test[y_test[:] == 1]), (len(y_test[y_test[:] == 1])/len(y_test)) * 100))
print("Test Target False Values       : {0} ({1:0.2f}%)".format(len(y_test[y_test[:] == 0]), (len(y_test[y_test[:] == 0])/len(y_test)) * 100))
print("")
Original Target True Values    : 5168 (11.63%)
Original Target False Values   : 39282 (88.37%)

Training Target True Values    : 3629 (11.66%)
Training Target False Values   : 27486 (88.34%)

Test Target True Values        : 1539 (11.54%)
Test Target False Values       : 11796 (88.46%)

Lets predict using models!!

Logistic Regression

In [155]:
#Import required metrics libraries
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

#Logistic regression model import
from sklearn.linear_model import LogisticRegression

#importing cross validation and Grid search library
from sklearn.model_selection import GridSearchCV

Hyper tuning parameters with the help of GridsearchCV

In [47]:
grid={"C":np.logspace(-3,3,7), "penalty":["l1","l2"]}# l1 lasso l2 ridge
model=LogisticRegression(max_iter=50000)
model_cv=GridSearchCV(model,grid,cv=10)
model_cv.fit(X_train,y_train)

print("Tuned hpyerparameters :(best parameters) ",model_cv.best_params_)
print("Accuracy :",model_cv.best_score_)
Tuned hpyerparameters :(best parameters)  {'C': 1000.0, 'penalty': 'l2'}
Accuracy : 0.8824951687946638
In [156]:
# Fit the model with the help of hyper parameters

model2=LogisticRegression(C=1000,penalty="l2")
#model2=LogisticRegression()
model2.fit(X_train,y_train)
Out[156]:
LogisticRegression(C=1000, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
In [157]:
#predict on test
y_pred = model2.predict(X_test)

print('\033[91m' + "Logistic Regression Stats" + '\033[0m')
print(classification_report(y_test,y_pred))
print("Recall : ",round(recall_score(y_test,y_pred),2))
print("Accuracy : ",round(accuracy_score(y_test,y_pred),2))
print("Precision : ",round(precision_score(y_test,y_pred),2))

print('\033[91m' + "Logistic Regression Confusion Matrix" + '\033[0m')
#print(confusion_matrix(y_test,y_pred))
mcm=confusion_matrix(y_test, y_pred, labels=[1, 0])

log_df_cm = pd.DataFrame(mcm, index = [i for i in ["1","0"]],
                  columns = [i for i in ["Predict 1","Predict 0"]])
log_df_cm

#coef_df = pd.DataFrame(model.coef_)
#coef_df['intercept'] = model.intercept_
#print(coef_df)
Logistic Regression Stats
              precision    recall  f1-score   support

           0       0.88      1.00      0.94     11796
           1       0.27      0.00      0.00      1539

    accuracy                           0.88     13335
   macro avg       0.58      0.50      0.47     13335
weighted avg       0.81      0.88      0.83     13335

Recall :  0.0
Accuracy :  0.88
Precision :  0.27
Logistic Regression Confusion Matrix
Out[157]:
Predict 1 Predict 0
1 3 1536
0 8 11788

ROC/AUC Curve

In [158]:
log_roc_auc = roc_auc_score(y_test, model2.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, model2.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %02f)' % log_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

Naive Bayes

In [159]:
# To model the Gaussian Navie Bayes classifier
from sklearn.naive_bayes import GaussianNB
In [160]:
nb_clf = GaussianNB()
NB_result = nb_clf.fit(X_train, y_train)
nb_clf.fit(X_train, y_train)
Out[160]:
GaussianNB(priors=None, var_smoothing=1e-09)
In [161]:
y_pred = nb_clf.predict(X_test)

print('\033[91m' + "Naive Bayes Stats" + '\033[0m')
print(classification_report(y_test,y_pred))
print("Recall : ",round(recall_score(y_test,y_pred),2))
print("Accuracy : ",round(accuracy_score(y_test,y_pred),2))
print("Precision : ",round(precision_score(y_test,y_pred),2))

print('\033[91m' + "Naive Bayes Confusion Matrix" + '\033[0m')
#print(confusion_matrix(y_test,y_pred))
nv_mcm=confusion_matrix(y_test, y_pred, labels=[1, 0])

nv_df_cm = pd.DataFrame(nv_mcm, index = [i for i in ["1","0"]],
                  columns = [i for i in ["Predict 1","Predict 0"]])
nv_df_cm
Naive Bayes Stats
              precision    recall  f1-score   support

           0       0.91      0.86      0.89     11796
           1       0.25      0.35      0.29      1539

    accuracy                           0.80     13335
   macro avg       0.58      0.61      0.59     13335
weighted avg       0.83      0.80      0.82     13335

Recall :  0.35
Accuracy :  0.8
Precision :  0.25
Naive Bayes Confusion Matrix
Out[161]:
Predict 1 Predict 0
1 542 997
0 1619 10177

ROC/AUC Curve

In [163]:
NB_roc_auc = roc_auc_score(y_test,nb_clf.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, nb_clf.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='NB Classifier (area = %2f)' % NB_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('NB_ROC')
plt.show()

KNN-Classification

With KNN we deal with neighbouring data points. There is no specific criteria to choose the K value. Generally we start with neighbours of 3,5,7 etc., However when calculate accuracy score is reducing when calculated with 5 and 7. Hence we have confined to the test with k = 3.

In [164]:
# loading library
from sklearn.neighbors import KNeighborsClassifier
In [165]:
# instantiate learning model (k = 3)
knn = KNeighborsClassifier(n_neighbors = 3)

# fitting the model
knn.fit(X_train, y_train)
# predict the response
y_pred = knn.predict(X_test)
In [166]:
print('\033[91m' + "KNN Stats" + '\033[0m')
print(classification_report(y_test,y_pred))
print("Recall : ",round(recall_score(y_test,y_pred),2))
print("Accuracy : ",round(accuracy_score(y_test,y_pred),2))
print("Precision : ",round(precision_score(y_test,y_pred),2))

print('\033[91m' + "KNN Confusion Matrix" + '\033[0m')
#print(confusion_matrix(y_test,y_pred))
knn_mcm=confusion_matrix(y_test, y_pred, labels=[1, 0])

knn_df_cm = pd.DataFrame(knn_mcm, index = [i for i in ["1","0"]],
                  columns = [i for i in ["Predict 1","Predict 0"]])
knn_df_cm
KNN Stats
              precision    recall  f1-score   support

           0       0.90      0.96      0.93     11796
           1       0.33      0.14      0.20      1539

    accuracy                           0.87     13335
   macro avg       0.61      0.55      0.56     13335
weighted avg       0.83      0.87      0.84     13335

Recall :  0.14
Accuracy :  0.87
Precision :  0.33
KNN Confusion Matrix
Out[166]:
Predict 1 Predict 0
1 220 1319
0 442 11354
In [167]:
KNN_roc_auc = roc_auc_score(y_test,knn.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, knn.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='KNN Classifier (area = %2f)' % KNN_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('NB_ROC')
plt.show()

Decision Tree

In [168]:
from sklearn.tree import DecisionTreeClassifier
#from sklearn.feature_extraction.text import CountVectorizer  #DT does not take strings as input for the model fit step....
from IPython.display import Image  
#import pydotplus as pydot
from sklearn import tree
from os import system
In [169]:
from graphviz import Source
In [170]:
dTree = DecisionTreeClassifier(criterion = 'entropy')
dTree.fit(X_train, y_train)
Out[170]:
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')
In [171]:
print(dTree.score(X_train, y_train))
print(dTree.score(X_test, y_test))
1.0
0.833970753655793

Decisions trees are prone to overfit as in this case. Hence we implement pruning of branches. Lets see that further in regularization

In [172]:
train_char_label = ['1', '0']
Credit_Tree_File = open('credit_tree.dot','w')
dot_data = tree.export_graphviz(dTree, out_file=Credit_Tree_File, feature_names = list(X_train), class_names = list(train_char_label))
Credit_Tree_File.close()
In [174]:
retCode = system("dot -Tpng credit_tree.dot -o credit_tree.png")
if(retCode>0):
    print("system command returning error: "+str(retCode))
else:
    display(Image("credit_tree.png"))
In [204]:
    # importance of features in the tree building ( The importance of a feature is computed as the 
#(normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )

print(pd.DataFrame(dTree.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values("Imp", ascending= False))
                Imp
balance    0.236831
age        0.148846
day        0.127088
month      0.087488
campaign   0.068933
job        0.067702
pdays      0.061334
contact    0.044555
poutcome   0.042421
education  0.037424
housing    0.024167
marital    0.022563
loan       0.014049
previous   0.014037
default    0.002561
In [176]:
y_pred = dTree.predict(X_test)

print('\033[91m' + "Decision Tree Stats" + '\033[0m')
print(classification_report(y_test,y_pred))
print("Recall : ",round(recall_score(y_test,y_pred),2))
print("Accuracy : ",round(accuracy_score(y_test,y_pred),2))
print("Precision : ",round(precision_score(y_test,y_pred),2))

print('\033[91m' + "Decision Tree Confusion Matrix" + '\033[0m')
#print(confusion_matrix(y_test,y_pred))
dt_mcm=confusion_matrix(y_test, y_pred, labels=[1, 0])

dt_df_cm = pd.DataFrame(dt_mcm, index = [i for i in ["1","0"]],
                  columns = [i for i in ["Predict 1","Predict 0"]])
dt_df_cm
Decision Tree Stats
              precision    recall  f1-score   support

           0       0.91      0.90      0.91     11796
           1       0.30      0.33      0.32      1539

    accuracy                           0.83     13335
   macro avg       0.61      0.62      0.61     13335
weighted avg       0.84      0.83      0.84     13335

Recall :  0.33
Accuracy :  0.83
Precision :  0.3
Decision Tree Confusion Matrix
Out[176]:
Predict 1 Predict 0
1 514 1025
0 1189 10607

ROC/AUC Curve

In [177]:
DT_roc_auc = roc_auc_score(y_test,dTree.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, dTree.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='DTree Classifier (area = %2f)' % DT_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic') 
plt.legend(loc="lower right")
plt.savefig('DT_ROC')
plt.show()

Regularized Decision Tree

Lets restrict the depth to 7 and rerun decision tree

In [178]:
reg_dt_model = DecisionTreeClassifier(criterion = 'entropy', max_depth = 7)
reg_dt_model.fit(X_train, y_train)
Out[178]:
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
                       max_depth=7, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')
In [206]:
print(reg_dt_model.score(X_train, y_train))
print(reg_dt_model.score(X_test, y_test))
0.8966414912421662
0.8948631421072366
In [207]:
train_char_label = ['1', '0']
reg_dt_model_File = open('reg_dt_model.dot','w')
dot_data = tree.export_graphviz(reg_dt_model, out_file=reg_dt_model_File, feature_names = list(X_train), class_names = list(train_char_label))
reg_dt_model_File.close()
In [208]:
retCode = system("dot -Tpng reg_dt_model.dot -o reg_dt_model_File.png")
if(retCode>0):
    print("system command returning error: "+str(retCode))
else:
    display(Image("reg_dt_model_File.png"))
In [205]:
    # importance of features in the tree building ( The importance of a feature is computed as the 
#(normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )

print(pd.DataFrame(reg_dt_model.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values("Imp", ascending= False))
                Imp
poutcome   0.212633
contact    0.199041
month      0.154775
age        0.120007
pdays      0.116148
housing    0.061687
balance    0.046497
marital    0.036825
day        0.031762
campaign   0.008774
education  0.007323
previous   0.003061
job        0.001466
default    0.000000
loan       0.000000
In [214]:
y_pred = reg_dt_model.predict(X_test)

print('\033[91m' + "Regularized Decision Tree Stats" + '\033[0m')
print(classification_report(y_test,y_pred))
print("Recall : ",round(recall_score(y_test,y_pred),2))
print("Accuracy : ",round(accuracy_score(y_test,y_pred),2))
print("Precision : ",round(precision_score(y_test,y_pred),2))

print('\033[91m' + "Regularized Decision Tree Confusion Matrix" + '\033[0m')
#print(confusion_matrix(Regularized y_test,y_pred))
reg_dt_mcm=confusion_matrix(y_test, y_pred, labels=[1, 0])

reg_dt_df_cm = pd.DataFrame(reg_dt_mcm, index = [i for i in ["1","0"]],
                  columns = [i for i in ["Predict 1","Predict 0"]])
reg_dt_df_cm
Regularized Decision Tree Stats
              precision    recall  f1-score   support

           0       0.90      0.99      0.94     11796
           1       0.66      0.18      0.29      1539

    accuracy                           0.89     13335
   macro avg       0.78      0.59      0.61     13335
weighted avg       0.87      0.89      0.87     13335

Recall :  0.18
Accuracy :  0.89
Precision :  0.66
Regularized Decision Tree Confusion Matrix
Out[214]:
Predict 1 Predict 0
1 281 1258
0 144 11652
In [44]:
DT_roc_auc = roc_auc_score(y_test,reg_dt_model.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, reg_dt_model.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Regularized DTree Classifier (area = %2f)' % DT_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic') 
plt.legend(loc="lower right")
plt.savefig('Reg_DT_ROC')
plt.show()

Ensemble Learning - Bagging

Bagging with the decision tree as base model

In [184]:
from sklearn.ensemble import BaggingClassifier

bgcl = BaggingClassifier(base_estimator=dTree, n_estimators=50)

#bgcl = BaggingClassifier(n_estimators=50)
bgcl = bgcl.fit(X_train, y_train)
In [185]:
print(bgcl.score(X_train, y_train))
print(bgcl.score(X_test, y_test))
0.9993250843644544
0.8948631421072366
In [186]:
y_pred = bgcl.predict(X_test)

print('\033[91m' + "Bagging Stats" + '\033[0m')
print(classification_report(y_test,y_pred))
print("Recall : ",round(recall_score(y_test,y_pred),2))
print("Accuracy : ",round(accuracy_score(y_test,y_pred),2))
print("Precision : ",round(precision_score(y_test,y_pred),2))

print('\033[91m' + "Bagging Confusion Matrix" + '\033[0m')
#print(confusion_matrix(y_test,y_pred))
bagg_mcm=confusion_matrix(y_test, y_pred, labels=[1, 0])

bagg_mcm_df_cm = pd.DataFrame(bagg_mcm, index = [i for i in ["1","0"]],
                  columns = [i for i in ["Predict 1","Predict 0"]])
bagg_mcm_df_cm
Bagging Stats
              precision    recall  f1-score   support

           0       0.91      0.98      0.94     11796
           1       0.61      0.25      0.35      1539

    accuracy                           0.89     13335
   macro avg       0.76      0.61      0.65     13335
weighted avg       0.87      0.89      0.87     13335

Recall :  0.25
Accuracy :  0.89
Precision :  0.61
Bagging Confusion Matrix
Out[186]:
Predict 1 Predict 0
1 382 1157
0 245 11551
In [187]:
bgcl_roc_auc = roc_auc_score(y_test,bgcl.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, bgcl.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Bagging Classifier (area = %2f)' % bgcl_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic') 
plt.legend(loc="lower right")
plt.savefig('bgcl_ROC')
plt.show()

Ensemble Learning - AdaBoosting

In [213]:
from sklearn.ensemble import AdaBoostClassifier
abcl = AdaBoostClassifier(base_estimator=dTree, n_estimators=10)
#abcl = AdaBoostClassifier( n_estimators=50)
abcl = abcl.fit(X_train, y_train)
In [189]:
print(abcl.score(X_train, y_train))
print(abcl.score(X_test, y_test))
1.0
0.8372703412073491
In [191]:
y_pred = abcl.predict(X_test)

print('\033[91m' + "Adaboost Stats" + '\033[0m')
print(classification_report(y_test,y_pred))
print("Recall : ",round(recall_score(y_test,y_pred),2))
print("Accuracy : ",round(accuracy_score(y_test,y_pred),2))
print("Precision : ",round(precision_score(y_test,y_pred),2))

print('\033[91m' + "Adaboost Confusion Matrix" + '\033[0m')
#print(confusion_matrix(y_test,y_pred))
abcl_mcm=confusion_matrix(y_test, y_pred, labels=[1, 0])

abcl_mcm_df_cm = pd.DataFrame(abcl_mcm, index = [i for i in ["1","0"]],
                  columns = [i for i in ["Predict 1","Predict 0"]])
abcl_mcm_df_cm
Adaboost Stats
              precision    recall  f1-score   support

           0       0.91      0.90      0.91     11796
           1       0.31      0.33      0.32      1539

    accuracy                           0.84     13335
   macro avg       0.61      0.62      0.61     13335
weighted avg       0.84      0.84      0.84     13335

Recall :  0.33
Accuracy :  0.84
Precision :  0.31
Adaboost Confusion Matrix
Out[191]:
Predict 1 Predict 0
1 511 1028
0 1142 10654
In [192]:
abcl_roc_auc = roc_auc_score(y_test,abcl.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, abcl.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Adaboost Classifier (area = %2f)' % abcl_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic') 
plt.legend(loc="lower right")
plt.savefig('abcl_ROC')
plt.show()
In [ ]:
 

Ensemble Learning - GradientBoost

In [203]:
from sklearn.ensemble import GradientBoostingClassifier
gbcl = GradientBoostingClassifier(n_estimators = 50)
gbcl = gbcl.fit(X_train, y_train)
In [194]:
print(gbcl.score(X_train, y_train))
print(gbcl.score(X_test, y_test))
0.8934276072633778
0.8943382077240345
In [195]:
y_pred = gbcl.predict(X_test)

print('\033[91m' + "Gradient Boost Stats" + '\033[0m')
print(classification_report(y_test,y_pred))
print("Recall : ",round(recall_score(y_test,y_pred),2))
print("Accuracy : ",round(accuracy_score(y_test,y_pred),2))
print("Precision : ",round(precision_score(y_test,y_pred),2))

print('\033[91m' + "Gradient Boost Confusion Matrix" + '\033[0m')
#print(confusion_matrix(y_test,y_pred))
gbcl_mcm=confusion_matrix(y_test, y_pred, labels=[1, 0])

gbcl_mcm_df_cm = pd.DataFrame(gbcl_mcm, index = [i for i in ["1","0"]],
                  columns = [i for i in ["Predict 1","Predict 0"]])
gbcl_mcm_df_cm
Gradient Boost Stats
              precision    recall  f1-score   support

           0       0.90      0.99      0.94     11796
           1       0.69      0.15      0.25      1539

    accuracy                           0.89     13335
   macro avg       0.80      0.57      0.60     13335
weighted avg       0.88      0.89      0.86     13335

Recall :  0.15
Accuracy :  0.89
Precision :  0.69
Decision Tree Confusion Matrix
Out[195]:
Predict 1 Predict 0
1 233 1306
0 103 11693
In [196]:
GB_roc_auc = roc_auc_score(y_test,gbcl.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, gbcl.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Gradient Boost Classifier (area = %2f)' % GB_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic') 
plt.legend(loc="lower right")
plt.savefig('GB_ROC')
plt.show()

Ensemble Learning - RandomForest

In [199]:
from sklearn.ensemble import RandomForestClassifier
rfcl = RandomForestClassifier(n_estimators = 50)
rfcl = rfcl.fit(X_train, y_train)
In [200]:
print(rfcl.score(X_train, y_train))
print(rfcl.score(X_test, y_test))
0.9991643901655151
0.8936632920884889
In [201]:
y_pred = rfcl.predict(X_test)

print('\033[91m' + "Random Forest Stats" + '\033[0m')
print(classification_report(y_test,y_pred))
print("Recall : ",round(recall_score(y_test,y_pred),2))
print("Accuracy : ",round(accuracy_score(y_test,y_pred),2))
print("Precision : ",round(precision_score(y_test,y_pred),2))

print('\033[91m' + "Random Forest Confusion Matrix" + '\033[0m')
#print(confusion_matrix(y_test,y_pred))
rfcl_mcm=confusion_matrix(y_test, y_pred, labels=[1, 0])

rfcl_mcm_df_cm = pd.DataFrame(rfcl_mcm, index = [i for i in ["1","0"]],
                  columns = [i for i in ["Predict 1","Predict 0"]])
rfcl_mcm_df_cm
Random Forest Stats
              precision    recall  f1-score   support

           0       0.91      0.98      0.94     11796
           1       0.61      0.21      0.31      1539

    accuracy                           0.89     13335
   macro avg       0.76      0.60      0.63     13335
weighted avg       0.87      0.89      0.87     13335

Recall :  0.21
Accuracy :  0.89
Precision :  0.61
Random Forest Confusion Matrix
Out[201]:
Predict 1 Predict 0
1 325 1214
0 204 11592
In [202]:
RF_roc_auc = roc_auc_score(y_test,rfcl.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, rfcl.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Randomforest Classifier (area = %2f)' % RF_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic') 
plt.legend(loc="lower right")
plt.savefig('RF_ROC')
plt.show()

Conclusions:

Dataset is highly unbalanced and have very weak relation between the features. As we are restricting it to ensemble methods, none of the weakly correlated were excluded. This can be further performed in the feature selection in PCA/LDA techniques. Recall values are as well very low however we have obtained better results with ensemble learning.

Based on ROC/AUC curve below is the order of model performance based on False Negatives and Recall values.

Adboosting with decision tree (Recall= .33) > Bagging with decison tree (Recall = .25) > Regularized decision tree with the depth of 7(Recall = 0.18)

However there is still scope in improving this with the help of feature selection.

In [ ]: